Microbiome analysis with QIIME2

…BRIEF INTRO IN PROGRESS…


A tentative snakemake workflow that defines qiime2 bioinformatics rules in a DAG (directed acyclic graph) format. A detailed interactive snakemake HTML report is available here. Use a wider screen to get a better interactive snakemake report.


Getting started with QIIME 2 pipeline

Get QIIME2 YAML file

wget https://data.qiime2.org/distro/core/qiime2-2023.2-py38-osx-conda.yml

Create qiime2 env and install qiime2

Current YAML file: qiime2-2023.2-py38-osx-conda.yml available

The qiime2 YAML file contains over 500 dependencies. Listed below is just a few QIIME 2 framework dependencies to get the installation started.

name: qiime2202320
channels:
    - qiime2/label/r2023.2
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - q2cli=2023.2.0
    - qiime2=2023.2.0
    - python=3.8.16
    - q2-alignment=2023.2.0
    - q2-composition=2023.2.0
    - q2-cutadapt=2023.2.0
    - q2-dada2=2023.2.0
    - q2-deblur=2023.2.0
    - q2-demux=2023.2.0
    - q2-diversity=2023.2.0
    - q2-diversity-lib=2023.2.0
    - q2-emperor=2023.2.0
    - q2-feature-classifier=2023.2.0
    - q2-feature-table=2023.2.0
    - q2-fragment-insertion=2023.2.0
    - q2-gneiss=2023.2.0
    - q2-longitudinal=2023.2.0
    - q2-metadata=2023.2.0
    - q2-mystery-stew=2023.2.0
    - q2-phylogeny=2023.2.0
    - q2-quality-control=2023.2.0
    - q2-quality-filter=2023.2.0
    - q2-sample-classifier=2023.2.0
    - q2-taxa=2023.2.0
    - q2-types=2023.2.0
    - q2-vsearch=2023.2.0

Installing QIIME2 using a bash script

conda activate base
wget https://data.qiime2.org/distro/core/qiime2-2023.2-py38-osx-conda.yml
conda env create -n qiime2-2023.2 --file qiime2-2023.2-py38-osx-conda.yml
conda activate qiime2-2023.2 
qiime info

Downloading demo data

Demo data from one of QIIME 2[1] tutorials.

mkdir -p resources
mkdir -p resources/reads
mkdir -p resources/references

cd mkdir -p resources/reads

wget \
  -O "sample-metadata.tsv" \
  "https://data.qiime2.org/2023.2/tutorials/atacama-soils/sample_metadata.tsv"

wget \
  -O "emp-paired-end-sequences/forward.fastq.gz" \
  "https://data.qiime2.org/2023.2/tutorials/atacama-soils/10p/forward.fastq.gz"

wget \
  -O "emp-paired-end-sequences/reverse.fastq.gz" \
  "https://data.qiime2.org/2023.2/tutorials/atacama-soils/10p/reverse.fastq.gz"

wget \
  -O "emp-paired-end-sequences/barcodes.fastq.gz" \
  "https://data.qiime2.org/2023.2/tutorials/atacama-soils/10p/barcodes.fastq.gz"

Download a QIIME 2 trained classifer

wget \
  -O "gg-13-8-99-515-806-nb-classifier.qza" \
  "https://data.qiime2.org/2023.2/common/gg-13-8-99-515-806-nb-classifier.qza"

Other classifiers also exist. Check on QIIME2 website for more information.


Classification methods

1. De novo clustering

Sequences are clustered against one another.

Closed-reference clustering

Here the clustering is performed at 99% identity against the Greengenes reference database.

Open-reference clustering

Here the clustering is performed at 99% identity against the Greengenes reference database.


Alignment of representative sequences

  • The MAFFT (Multiple Alignment using Fast Fourier Transform) software provides alignments of the representative sequences.
  • Then we will run alignment mask function to remove poor alignments.


Quality control and feature table with DADA2

QIIME2 uses DADA2[2] tool for:

  • Detecting poor reads in Illumina amplicon sequence data.
  • Denoising.
  • Filtering chimeric sequences.
  • Filtering any phiX reads present in marker gene.
  • Construction of feature table.

Using custom SILVA classifier

  • Silva resources
  • Taxonomy files
  • Below is a simple outline of the steps involved for constructing a QIIME 2 compatible reference from SILVA.
  • Begin by downloading the relevant taxonomy and sequence files from the SILVA.
  • Import these files into QIIME 2.
  • Prepare a fixed-rank taxonomy file.
  • Remove sequences with excessive degenerate bases and homopolymers.
  • Remove sequences that may be too short and/or long. With the option to condition the length filtering based on taxonomy.
  • Dereplicate the sequences and taxonomy.
  • Build our classifier.

Preparing SILVA reference database

How? As described here

QIIME2 diversity analysis

As described here

Alpha diversity metrics

  • Shannon’s diversity index (a quantitative measure of community richness)-
  • Observed OTUs (a qualitative measure of community richness)
  • Faith’s Phylogenetic Diversity (a qualitiative measure of community richness that incorporates phylogenetic relationships between the features)
  • Evenness (or Pielou’s Evenness; a measure of community evenness)

Beta diversity metrics

  • Jaccard distance (a qualitative measure of community dissimilarity)
  • Bray-Curtis distance (a quantitative measure of community dissimilarity)
  • unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)
  • weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)

Expected output artifacts

  • Saved FeatureTable[Frequency] to: qiime2_process/core-metrics/rarefied_table.qza
  • Saved SampleData[AlphaDiversity] to: qiime2_process/core-metrics/faith_pd_vector.qza
  • Saved SampleData[AlphaDiversity] to: qiime2_process/core-metrics/observed_features_vector.qza
  • Saved SampleData[AlphaDiversity] to: qiime2_process/core-metrics/shannon_vector.qza
  • Saved SampleData[AlphaDiversity] to: qiime2_process/core-metrics/evenness_vector.qza
  • Saved DistanceMatrix to: qiime2_process/core-metrics/unweighted_unifrac_distance_matrix.qza
  • Saved DistanceMatrix to: qiime2_process/core-metrics/weighted_unifrac_distance_matrix.qza
  • Saved DistanceMatrix to: qiime2_process/core-metrics/jaccard_distance_matrix.qza
  • Saved DistanceMatrix to: qiime2_process/core-metrics/bray_curtis_distance_matrix.qza
  • Saved PCoAResults to: qiime2_process/core-metrics/unweighted_unifrac_pcoa_results.qza
  • Saved PCoAResults to: qiime2_process/core-metrics/weighted_unifrac_pcoa_results.qza
  • Saved PCoAResults to: qiime2_process/core-metrics/jaccard_pcoa_results.qza
  • Saved PCoAResults to: qiime2_process/core-metrics/bray_curtis_pcoa_results.qza

Expected output visualizations

  • Saved Visualization to: qiime2_process/core-metrics/unweighted_unifrac_emperor.qzv
  • Saved Visualization to: qiime2_process/core-metrics/weighted_unifrac_emperor.qzv
  • Saved Visualization to: qiime2_process/core-metrics/jaccard_emperor.qzv
  • Saved Visualization to: qiime2_process/core-metrics/bray_curtis_emperor.qzv

Exploring beta diversity metric

    • Qiime2 uses Emperorinteractive tool[3] for exploring beta diversity metrics

Differential abundance

QIIME2 uses ANCOM (Analysis of Composition) to identify differentially abundant taxa.




Citation

Please consider citing the iMAP article[4] if you find any part of the IMAP practical user guides helpful in your microbiome data analysis.


References

[1]
Bolyen, E., Rideout, J. R., Dillon, M. R., Bokulich, N. A., Abnet, C. C., Al-Ghalith, G. A., … Caporaso, J. G. (2019). Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology, 37(8), 852–857. https://doi.org/10.1038/s41587-019-0209-9
[2]
Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: High-resolution sample inference from illumina amplicon data. Nature Methods, 13(7), 581. https://doi.org/10.1038/nmeth.3869
[3]
Vazquez-Baeza, Y., Pirrung, M., Gonzalez, A., & Knight, R. (2013). EMPeror: A tool for visualizing high-throughput microbial community data. Gigascience, 2, 16. https://doi.org/10.1186/2047-217X-2-16
[4]
Buza, T. M., Tonui, T., Stomeo, F., Tiambo, C., Katani, R., Schilling, M., … Kapur, V. (2019). iMAP: An integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics, 20. https://doi.org/10.1186/S12859-019-2965-4



Appendix

Project main tree

.
├── LICENSE.md
├── README.md
├── config
│   ├── config.yml
│   ├── pbs
│   ├── samples.tsv
│   ├── slurm
│   └── units.tsv
├── dags
│   ├── rulegraph.png
│   └── rulegraph.svg
├── data
│   ├── README.md
│   ├── logs
│   ├── metadata
│   ├── reads
│   ├── references
│   └── test
├── docs
│   └── primers.md
├── images
│   ├── 16srrna.png
│   ├── bioinformatics.png
│   ├── bkgd.png
│   ├── imap_part05.svg
│   └── smkreport
├── imap-bioinformatics-qiime2.Rproj
├── index.Rmd
├── library
│   ├── apa.csl
│   ├── export.bib
│   ├── imap.bib
│   └── references.bib
├── qiime2_process
│   ├── aligned-rep-seqs.qza
│   ├── alpha-rarefaction.qzv
│   ├── alpha_diversity
│   ├── beta_diversity
│   ├── demux.qza
│   ├── demux.qzv
│   ├── feature-table-dn-99.qza
│   ├── feature-table-genus.qza
│   ├── feature-table.biom
│   ├── feature-table.qza
│   ├── feature-table.qzv
│   ├── feature-taxonomy-table.qzv
│   ├── masked-aligned-rep-seqs.qza
│   ├── new-ref-seqs-or-85.qza
│   ├── rep-seqs-cr-85.qza
│   ├── rep-seqs-dn-99.qza
│   ├── rep-seqs-or-85.qza
│   ├── rep-seqs.qza
│   ├── rep-seqs.qzv
│   ├── rooted-tree
│   ├── rooted-tree.qza
│   ├── sample-metadata.qzv
│   ├── sample-metadata.tsv
│   ├── stats.qza
│   ├── stats.qzv
│   ├── table-cr-85.qza
│   ├── table-or-85.qza
│   ├── taxonomy.qza
│   ├── taxonomy.qzv
│   ├── transformed
│   ├── unmatched-cr-85.qza
│   ├── unrooted-tree
│   └── unrooted-tree.qza
├── report.html
├── resources
│   ├── gg_classifier
│   ├── metadata
│   ├── reads
│   ├── silva-138.1-ssu-nr99-seqs-515f-806r-uniq.qza
│   ├── silva-138.1-ssu-nr99-tax-515f-806r-derep-uniq.qza
│   ├── silva_classifier
│   └── test
├── results
│   └── project_tree.txt
├── styles.css
└── workflow
    ├── Snakefile
    ├── envs
    ├── report
    ├── rules
    └── scripts

33 directories, 53 files


Unlocking working directory.
Building DAG of jobs...
IncompleteFilesException:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with

    snakemake --cleanup-metadata <filenames>

To re-generate the files rerun your command with the --rerun-incomplete flag.
Incomplete files:
qiime2_process/transformed/feature-table.tsv
qiime2_process/transformed/taxonomy.tsv
[0407/010746.722119:ERROR:xattr.cc(63)] setxattr org.chromium.crashpad.database.initialized on file /var/folders/_6/mt8ts63j5c35k8kdk17yzq7h0000gn/T/: Operation not permitted (1)
[0407/010746.724319:ERROR:file_io.cc(91)] ReadExactly: expected 8, observed 0
[0407/010746.727307:ERROR:xattr.cc(63)] setxattr org.chromium.crashpad.database.initialized on file /var/folders/_6/mt8ts63j5c35k8kdk17yzq7h0000gn/T/: Operation not permitted (1)
[0407/010748.548476:ERROR:command_buffer_proxy_impl.cc(125)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
[0407/010753.313996:INFO:headless_shell.cc(653)] Written to file /Users/tmbuza/Dropbox/MICROBIOME/imap-bioinformatics-qiime2/images/smkreport/screenshot.png.
Created 1 file(s):
    /Users/tmbuza/Dropbox/MICROBIOME/imap-bioinformatics-qiime2/images/smkreport/screenshot.png


Troubleshooting of FAQs

  1. Question
    • Answer
  2. Question
    • Answer